We are interested in the factors that potentially effect the revenue of each taxi trip. The total amount is the sum of fare amount, tax, tip amount, tolls amount and improvement surcharge. Tolls amount is fixed. Since tax and tip are dependent on fare amount, we are interested in fare amount.
According to a tipping guide that’s published by The New York Times in their Travel section back in 2006, the tipping rate is 15-20% for taxi drivers. So much has chnanged since then. Uber, lyft and other ride-hailing service companies have entered the market; paying has never become easier now that you can pay through an app on your phone and choose how much you want to tip. Even though most of time the tip percent will fall between 15% and 20% in NYC, it still varies across difference trips. We also hope to explore the reasons for higher percent tip.

The rule of standard metered taxi fare is too complicated . It is difficult for a passenger to estimate the fare amount refering to such complext instructions.
According to the codebook from the TLC website, the fare amount is defined as “The time-and-distance fare calculated by the meter”. Therefore, we hope to explore the association between fare amount and trip distance and duration. If possible, we also want to roughly predict the fare amount by duration and distance.
Data was queried from year 2018, covering all NYC boroughs. The duration is calculated from drop-off time and pick-up time. There might be data entry error since a large number of records with fare_amount of 52 have different distance and duration. We are also not interested in the trips with no time or a long duration, i.e. longer than 6 hours. The initial charge of NYC Yellow Cab is 2.5 dollars. Therefore, we exclude the data with duration longer than 360 or equals to 0, fare amount of 52 or less than 2.5.
pilot <-
taxi_2018 %>%
filter(
passenger_count != 0,
trip_distance != 0,
total_amount > 0,
fare_amount != 0
) %>%
mutate(tpep_pickup_datetime = as.character(tpep_pickup_datetime),
tpep_dropoff_datetime = as.character(tpep_dropoff_datetime)) %>%
separate(tpep_pickup_datetime, into = c("a","b"), sep = " ") %>%
separate(a, into = c("pick_year","pick_month","pu_date"), sep = "-") %>%
separate(b, into = c("pu_hour","pu_min"), sep = ":") %>%
separate(tpep_dropoff_datetime, into = c("c","d"),sep = " ") %>%
separate(c,into = c("drop_year","drop_month","do_date"),sep = "-") %>%
separate(d, into = c("do_hour","do_min"), sep = ":") %>%
dplyr::select(-pick_month,-drop_month,-pick_year,-drop_year) %>%
mutate(
month = as.integer(month),
pu_date = as.integer(pu_date),
pu_hour = as.integer(pu_hour),
pu_min = as.integer(pu_min),
do_date = as.integer(do_date),
do_hour = as.integer(do_hour),
do_min = as.integer(do_min)
) %>%
left_join(weather, by = c("month" = "month", "pu_date" = "day")) %>%
left_join(zone_lookup, by = c("pu_location_id" = "location_id")) %>%
rename(
pu_boro = borough,
pu_zone = zone
) %>%
left_join(zone_lookup, by = c("do_location_id" = "location_id")) %>%
rename(
do_boro = borough,
do_zone = zone
) %>%
dplyr::select(-pu_location_id, -do_location_id)
reg = pilot %>%
mutate(
duration = case_when(
pu_date != do_date ~ do_min - pu_min + (60 *(do_hour - pu_hour + 24)),
pu_date == do_date ~ do_min - pu_min + (60 *(do_hour - pu_hour) )
),
tip_percent = tip_amount/fare_amount
) %>%
filter(duration != 0,
fare_amount != 52,#There might be data entry error since all records that fare_amount is 52 have different distance and duration#
duration <= 360, #There are some points gathering with a long duration without long distance, which can be miscoding or influntial outliers, so we should exclude these data.#
fare_amount >= 2.5 #There can be data entry error since the initial charge of NYC taxi is $2.5, so we exclude those with fare amount < $2.5.#
)
set.seed(1)
reg_plot = reg %>%
sample_frac(size = 0.1, replace = FALSE)
fare_distance = reg_plot %>%
ggplot(aes(x = trip_distance, y = fare_amount)) +
geom_point(alpha = 0.3, color = 'skyblue',size = 0.3) +
labs(
title = "The distribution of fare amount by trip distance and duration",
x = "Trip Distance",
y = "Fare Amount"
)
fare_duration = reg_plot %>%
ggplot(aes(x = duration, y = fare_amount)) +
geom_point(alpha = 0.3, color = 'red', size = 0.3) +
labs(
x = "Trip Duration",
y = "Fare Amount"
)
fare_distance + fare_duration

According to the scatterplots, there is a positive relationship among trip distance and fare amount. Trip duration is also positively associated with fare amount.
The outcome of linear regression model is fare amount, and the predictors include trip distance and duration.
fit1 = lm(fare_amount ~ duration + trip_distance , data = reg) %>%
broom::tidy() %>%
janitor::clean_names() %>%
mutate(p = round(p_value,3))
fit1 = lm(fare_amount ~ duration + trip_distance , data = reg )
summary(fit1) %>% broom::tidy() %>% knitr::kable(digit = 2)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.02 | 0 | 503.87 | 0 |
| duration | 0.35 | 0 | 1018.08 | 0 |
| trip_distance | 2.00 | 0 | 1735.95 | 0 |
summary(fit1) %>% broom::glance() %>% knitr::kable(digit = 2)
| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.93 | 0.93 | 2.58 | 7282469 | 0 | 3 |
The p-value of the two coefficients are both < 0.05. It is seen that both variables (duration, trip_distance) are statistically significant at 5 % level of significance.
Further we can plot the model diagnostic checking for other problems such as normality of error term, heteroscedasticity etc.
par(mfrow = c(2,2))
plot(fit1)

The plots indicate a lot of outliers, which make the model violate the linear regression assumptions.
It can be the result of the skewed distribution of fare amount.
reg_plot %>%
ggplot(aes(x = fare_amount)) +
geom_histogram(bins = 100) +
labs(
title = "Distribution of fare amount",
x = "Fare Amount",
y = " "
)

In addition, we realized that the duration will increase if the trip distance increased. Therefore, possibly there is multicollinearity problem in this model as well. For further diagnosis of the problem, let us first look at the pair-wise correlation among the explanatory variables.
corr = reg %>% dplyr::select(trip_distance, duration)
pcor(corr, method = "pearson")
## $estimate
## trip_distance duration
## trip_distance 1.0000000 0.7369451
## duration 0.7369451 1.0000000
##
## $p.value
## trip_distance duration
## trip_distance 0 0
## duration 0 0
##
## $statistic
## trip_distance duration
## trip_distance 0.000 1168.769
## duration 1168.769 0.000
##
## $n
## [1] 1149265
##
## $gp
## [1] 0
##
## $method
## [1] "pearson"
The correlation between trip distance and duration is 0.74. The p-value is smaller than 0.05, therefore, at 5% level of significance, trip distance and duration is related to each other.
imcdiag(corr, reg$fare_amount)
##
## Call:
## imcdiag(x = corr, y = reg$fare_amount)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1 IND2
## trip_distance 2.1886 0.4569 1366021 Inf 0.676 -0.2686 0 0 1
## duration 2.1886 0.4569 1366021 Inf 0.676 -0.2686 0 0 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## * all coefficients have significant t-ratios
##
## R-square of y on all x: 0.9269
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
The VIF, TOL and Wi columns provide the diagnostic output for variance inflation factor, tolerance and Farrar-Glauber F-test respectively. Since VIF<10 and TOL>0.1, there is no multicolinearity in the regression model.
In addition ot fare amount, we are also interested in the factors that may effect tip percent in each trip.
tip_distance = reg_plot %>%
filter(payment_type == 1) %>%
ggplot(aes(x = trip_distance, y = tip_percent)) +
geom_point(alpha = 0.3, color = 'skyblue', size = 0.1,na.rm = TRUE) +
labs(
title = "Tip percent by Trip Distance, Passenger Count, Precipitation",
x = "Trip Distance",
y = "Tip Percent"
) +
scale_y_continuous(
limits = c(0,1)
)
tip_passenger = reg_plot %>%
filter(payment_type == 1) %>%
ggplot(aes(x = passenger_count, y = tip_percent)) +
geom_point(alpha = 0.3, color = 'palevioletred2', size = 0.1,na.rm = TRUE) +
labs(
x = "Passenger Count",
y = "Tip Percent"
) +
scale_y_continuous(
limits = c(0,1)
)
tip_prcp = reg_plot %>%
filter(payment_type == 1) %>%
ggplot(aes(x = prcp, y = tip_percent)) +
geom_point(alpha = 0.3, color = 'palegreen3', size = 0.1,na.rm = TRUE) +
labs(
x = "Precipitation",
y = "Tip Percent"
) +
scale_y_continuous(
limits = c(0,1)
)
tip_distance + tip_passenger + tip_prcp

There is no clear pattern among tip percent and the three predictors. Therefore, we will not conduct regression analysis about tip percent.
Instead, we’ll look at the cab-tipping scence is like in Manhattan, where yellow cabs are mostly concentrated.
We look at tip percentage, which we generated by dividing the tip amount by the total fare amount.
Since only tips made by credit card are recorded, we’ll only be looking at rides that are paid with a credit card.
We look at rides that are either picked-up at and dropped-off at a neighborhood in Manhattan.
# plotly for do_zone
f1 <- list(
family = "Arial, sans-serif",
size = 18,
color = "grey"
)
f2 <- list(
family = "Old Standard TT, serif",
size = 14,
color = "black"
)
a <- list(
title = "Tip Percentage",
titlefont = f1,
showticklabels = TRUE,
tickangle = 45,
tickfont = f2,
exponentformat = "E",
range = c(0,1)
)
b <- list(
title = "Drop off Manhattan Neighborhoods",
titlefont = f1,
showticklabels = TRUE,
tickangle = 45,
tickfont = f2,
exponntformat = "E"
)
c <- list(
title = "Pick up Manhattan Neighborhoods",
titlefont = f1,
showticklabels = TRUE,
tickangle = 45,
tickfont = f2,
exponntformat = "E"
)
set.seed(1)
tip_plot %>%
sample_frac(size = 0.1, replace = FALSE) %>%
mutate(do_zone = forcats::fct_reorder(do_zone, tip_percent),
pu_zone = forcats::fct_reorder(pu_zone, tip_percent),
do_median = median(tip_percent),
text = str_c("Drop Off Zone: ", do_zone , "\nTip Percent: ", tip_percent, "\nMedian Percentage:", do_median) ) %>%
drop_na(do_zone, pu_zone) %>%
plot_ly(y = ~tip_percent, color = ~do_zone, text = ~text, type = "box", colors = "Set2", marker = list(size = 2)) %>%
layout(xaxis = b, yaxis = a)
We can see that of all 64 drop-off neighborhoods, Highbridge Park has the lowest median tip percentage at 10.73%, wehre Randalls Island has the highest median tip percentage at 25.46%. Overall, we do see that if the lower the cab go into Manhattan, the higher the percentage of tip is given.
set.seed(1)
tip_plot %>%
sample_frac(size = 0.1, replace = FALSE) %>%
mutate(do_zone = forcats::fct_reorder(do_zone, tip_percent),
pu_zone = forcats::fct_reorder(pu_zone, tip_percent),
pu_median = median(tip_percent),
text = str_c("Pick Up Zone: ", pu_zone , "\nTip Percent: ",
tip_percent, "\nMedian Percentage:", pu_median) ) %>%
drop_na(do_zone, pu_zone) %>%
plot_ly(y = ~tip_percent, color = ~pu_zone, text = ~text, type = "box", colors = "Set2", marker = list(size = 2)) %>%
layout(xaxis = c, yaxis = a)